Outline 0 Introduction In this paper…… An overview of the purpose of your analysis including details or references necessary.
1 Overview 1.1 Business Problem 1.2 Business Value Proposition
2 Data Description ◦the source ◦the variables with definitions or a link to a codebook ◦the number of observations in your data set ◦detail on missingness ◦a glimpse of your data if possible (e.g. head and tail)
3 Data Preprocessing ◦Feature generation ◦Imputation ◦Cleaning or merging of categories ◦Outlier removal ◦Anything that changes your data from the original form
4.Your Final Analysis in small pieces with annotation 5.Graphs to visualize different steps in your analysis 6.Clear discussion of why you made analysis choices 7.References to papers or citations you used to make decisions about the analysis
Questions to ask: business value proposition, Downloading data, bibliography
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names
wine=readRDS("wine.RDS")
head(wine)
library(DataExplorer)
library(kableExtra)
z<-introduce(wine)
z<-as.data.frame(t(z))
colnames(z)<-c()
knitr::kable(
z,
caption="Data Introduction"
) %>% kable_styling(bootstrap_options = c("striped", "hover"),
full_width = F,
font_size = 12,
position = "left")
| rows | 6497 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 6497 |
| total_observations | 84461 |
| memory_usage | 653376 |
library(DataExplorer)
plot_histogram(wine)
library(DataExplorer)
plot_bar(wine)
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
## wine.type
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2461
## 3rd Qu.:0.0000
## Max. :1.0000
str(wine)
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ wine.type : num 1 1 1 1 1 1 1 1 1 1 ...
library(DataExplorer)
plot_correlation(wine, type = "c")
library(DataExplorer)
plot_missing(wine)
## There are no missing data present in this dataset
wineRed = read.csv("winequality-red.csv",sep = ";")
wineWhite = read.csv("winequality-white.csv",sep = ";")
wineRed$wine.type <- 1
wineWhite$wine.type <- 0
dim(wineRed)
## [1] 1599 13
dim(wineWhite)
## [1] 4898 13
wine = rbind(wineRed, wineWhite)
wine$wine.type <- as.factor(wine$wine.type)
dim(wine)
## [1] 6497 13
| rows | 1599 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 1599 |
| total_observations | 20787 |
| memory_usage | 163576 |
| rows | 4898 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 4898 |
| total_observations | 63674 |
| memory_usage | 493472 |
| rows | 6497 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 6497 |
| total_observations | 84461 |
| memory_usage | 653376 |
set.seed(13)
trainIndex = sample(1:nrow(wine), size = round(0.75*nrow(wine)), replace=FALSE)
train<-wine[trainIndex, ]
valid<-wine[-trainIndex, ]
nrow(train)
## [1] 4873
nrow(valid)
## [1] 1624
step = lm(formula = quality ~ ., data = train)
summary(step)
##
## Call:
## lm(formula = quality ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5492 -0.4691 -0.0462 0.4576 2.9997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.158e+02 1.811e+01 6.394 1.77e-10 ***
## fixed.acidity 1.015e-01 1.949e-02 5.209 1.98e-07 ***
## volatile.acidity -1.528e+00 9.403e-02 -16.246 < 2e-16 ***
## citric.acid -1.636e-02 9.355e-02 -0.175 0.861173
## residual.sugar 6.754e-02 7.276e-03 9.283 < 2e-16 ***
## chlorides -3.467e-01 3.971e-01 -0.873 0.382609
## free.sulfur.dioxide 3.916e-03 8.883e-04 4.409 1.06e-05 ***
## total.sulfur.dioxide -1.260e-03 3.773e-04 -3.338 0.000849 ***
## density -1.158e+02 1.840e+01 -6.292 3.41e-10 ***
## pH 5.822e-01 1.086e-01 5.360 8.70e-08 ***
## sulphates 7.047e-01 8.903e-02 7.915 3.04e-15 ***
## alcohol 2.169e-01 2.301e-02 9.423 < 2e-16 ***
## wine.type 3.716e-01 6.813e-02 5.455 5.13e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7343 on 4860 degrees of freedom
## Multiple R-squared: 0.2993, Adjusted R-squared: 0.2975
## F-statistic: 173 on 12 and 4860 DF, p-value: < 2.2e-16
p.valid<-predict(step, newdata=valid)
head(p.valid)
## 5 7 13 15 17 20
## 4.936269 5.074472 5.372830 5.207970 6.029497 5.609911
library(caret)
RMSE(p.valid, valid$quality)
## [1] 0.7311232
R2(p.valid, valid$quality)
## [1] 0.2872661
pressure
paste("The mean pressure is:", round(mean(pressure$pressure),3), "mm")
[1] “The mean pressure is: 124.337 mm”
library(knitr)
kable(head(pressure), format="pipe", digit=3)
| temperature | pressure |
|---|---|
| 0 | 0.000 |
| 20 | 0.001 |
| 40 | 0.006 |
| 60 | 0.030 |
| 80 | 0.090 |
| 100 | 0.270 |
kable(tail(pressure), format="pipe", digit=3)
| temperature | pressure | |
|---|---|---|
| 14 | 260 | 96 |
| 15 | 280 | 157 |
| 16 | 300 | 247 |
| 17 | 320 | 376 |
| 18 | 340 | 558 |
| 19 | 360 | 806 |
You can also embed plots, for example:
For more details on organizing with tabset go here https://bookdown.org/yihui/rmarkdown-cookbook/html-tabs.html.
You can include both inline and offset equations.
You can include inline equations like \(y = nx + b\), you can also do more complicated inline equations such as \(\hat{y} = \hat{\beta} + \hat{\beta_1}x\)